Week 6
Alex Cardazzi
Old Dominion University
Estimating Models (intercept, trend, kink, ar, ma)
Visualization
Interpretation / Inference
\[Y = f(X, \epsilon)\]
If we want to model some Data Generating Process (DGP), we need the data we use to train/estimate the model to “represent” the data in the future. We need to know that something in the future will be similar to how it was in the past.
To test for stationarity, we have a few options.
Visual Inspection
ACF, PACF
Unit Root test (augmented Dickey-Fuller, KPSS, etc.)
ACF, PACF
Unit Root test (augmented Dickey-Fuller)
\(Y_{t} = \beta Y_{t-1} + e_t\)
\(Y_{t} - Y_{t-1} = \beta Y_{t-1} + e_t - Y_{t-1}\)
\(\Delta Y_{t} = (\beta-1)Y_{t-1} + e_t\)
Test if \((\beta - 1)\) is different from \(0\).
\((\beta - 1) \neq 0 \implies \beta \neq 1 \implies\) no unit root (i.e. stationarity)
Okay, our time series is non-stationary. What now?
We are still having trouble achieving stationarity with this time series due to its seasonal compenent. If this were (magically / mathematically) removed, we seem to have a constant variance and mean, though.
Let’s try another time series: Monthly Liquor Sales from FRED.
Log Seasonally Adjusted Sales Data (Difference)
\(Y = \alpha + \beta X + e\)
\(Y = \alpha + \beta \text{log}(X) + e\)
\(\text{log}(Y) = \alpha + \beta X + e\)
\(\text{log}(Y) = \alpha + \beta \text{log}(X) + e\)
We want to forecast liquor sales. However, this time series in levels is non-stationary. Instead, we’ll estimate growth rates in liquor sales and then transform back into sales. Throughout the process, visualize your transformation and test if it is stationary.
\[ \begin{align} log(Y_{t}) - log(Y_{t-1}) &= \Delta log(Y_{t}) \\ &= \alpha + \beta \Delta log(Y_{t-1}) + \epsilon_t \end{align} \]
If \(Y_{t-1}\) increased by 1% from \(Y_{t-2}\), we can expect an \(100*\beta\) % change from \(Y_{t-1}\) to \(Y_{t}\).
To be clear, let’s fit and interpret a trend model.
Call:
lm(formula = log(sales_adj) ~ t, data = liquor)
Residuals:
Min 1Q Median 3Q Max
-0.062226 -0.015790 -0.004704 0.016031 0.072428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.4254425 0.0028650 2591.8 <2e-16 ***
t 0.0031948 0.0000148 215.8 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02632 on 334 degrees of freedom
Multiple R-squared: 0.9929, Adjusted R-squared: 0.9929
F-statistic: 4.659e+04 on 1 and 334 DF, p-value: < 2.2e-16
Plot the model’s fitted values.
Let’s find the average year over year growth of this variable. Then, let’s compare this to the regression coefficient.
[1] "Regression Coefficient: 0.00319; Raw average: 0.00321"
This suggests an average .3% increase per month.
Now we need to figure out how to model this process. We can use the ACF and PACF to guide our thinking.
Estimate an AR(1) and AR(12) model.
| Dependent variable: | ||
| diff | diff | |
| (1) | (2) | |
| ar1 | -0.388*** | -0.448*** |
| (0.051) | (0.054) | |
| ar2 | -0.147** | |
| (0.059) | ||
| ar3 | -0.047 | |
| (0.060) | ||
| ar4 | 0.059 | |
| (0.059) | ||
| ar5 | 0.004 | |
| (0.059) | ||
| ar6 | -0.008 | |
| (0.060) | ||
| ar7 | 0.003 | |
| (0.060) | ||
| ar8 | -0.049 | |
| (0.060) | ||
| ar9 | -0.084 | |
| (0.060) | ||
| ar10 | -0.094 | |
| (0.060) | ||
| ar11 | -0.113* | |
| (0.059) | ||
| ar12 | -0.147*** | |
| (0.054) | ||
| intercept | 0.003*** | 0.003*** |
| (0.0005) | (0.0003) | |
| Observations | 335 | 335 |
| Log Likelihood | 1,016.674 | 1,025.347 |
| sigma2 | 0.0001 | 0.0001 |
| Akaike Inf. Crit. | -2,027.348 | -2,022.694 |
| Note: | p<0.1; p<0.05; p<0.01 | |
liquor2 <- data.frame(date = max(liquor$date) + months(1:24),
sales = NA, sales_adj = NA,
t = max(liquor$t) + 1:24)
liquor <- rbind(liquor, liquor2)
p1 <- predict(r1, n.ahead = 24, interval = "predict")
p2 <- predict(r2, n.ahead = 24, interval = "predict")
liquor$p1 <- liquor$p2 <- liquor$sales_adj
liquor$p2_u <- liquor$p2_l <- NA
w <- which(is.na(liquor$sales_adj))
for(i in 1:24){
liquor$p1[w[i]] <- liquor$p1[w[i] - 1]*(1 + sinh(p1$pred[i]))
liquor$p2[w[i]] <- liquor$p2[w[i] - 1]*(1 + sinh(p2$pred[i]))
liquor$p2_u[w[i]] <- liquor$p2[w[i] - 1]*(1 + sinh(p2$pred[i]) + 1.645*sinh(p2$se[i]))
liquor$p2_l[w[i]] <- liquor$p2[w[i] - 1]*(1 + sinh(p2$pred[i]) - 1.645*sinh(p2$se[i]))
}
lim <- liquor$date >= ymd("2015-01-01")
plot(liquor$date[lim], liquor$p1[lim], type = "l", col = scales::alpha("tomato", .6), lwd = 3)
lines(liquor$date[lim], liquor$p2[lim], col = scales::alpha("dodgerblue", .6), lwd = 3)
lines(liquor$date[lim], liquor$sales_adj[lim], type = "l", lwd = 3)
lines(liquor$date[lim], liquor$p2_l[lim], type = "l", col = "dodgerblue")
lines(liquor$date[lim], liquor$p2_u[lim], type = "l", col = "dodgerblue")Suppose you are a car manufacturer. Each month, you manufacture \(m\) vehicles with some error \(e_t\). \(e_t\) can be negative some employees are on vacation. \(e_t\) can be positive if an employee brings coffee for everyone one day.
\[\text{V}_t = m + e_t\]
Each time period, some fraction (\(1-\theta\)) of vehicles are sold, and some are left over (\(\theta\)).
Therefore, your inventory for every period will look like:
\[\begin{align} \text{I}_t = V_t + \theta V_{t-1} &= (m + e_t) + \theta (m + e_{t-1}) \\ &= \mu + e_t + \theta e_{t-1} \end{align}\]
Maybe we can model alcohol sales as a moving average process instead of autoregressive.
| Dependent variable: | ||
| diff | diff | |
| (1) | (2) | |
| ma1 | -0.409*** | -0.466*** |
| (0.047) | (0.056) | |
| ma2 | 0.053 | |
| (0.061) | ||
| ma3 | 0.0003 | |
| (0.060) | ||
| ma4 | 0.109* | |
| (0.060) | ||
| ma5 | -0.008 | |
| (0.061) | ||
| ma6 | 0.070 | |
| (0.064) | ||
| ma7 | 0.025 | |
| (0.064) | ||
| ma8 | -0.023 | |
| (0.064) | ||
| ma9 | -0.026 | |
| (0.061) | ||
| ma10 | -0.025 | |
| (0.057) | ||
| ma11 | -0.128** | |
| (0.060) | ||
| ma12 | -0.186*** | |
| (0.064) | ||
| intercept | 0.003*** | 0.003*** |
| (0.0004) | (0.0003) | |
| Observations | 335 | 335 |
| Log Likelihood | 1,018.479 | 1,030.445 |
| sigma2 | 0.0001 | 0.0001 |
| Akaike Inf. Crit. | -2,030.958 | -2,032.890 |
| Note: | p<0.1; p<0.05; p<0.01 | |
pm1 <- predict(m1, n.ahead = 24, interval = "predict")
pm2 <- predict(m2, n.ahead = 24, interval = "predict")
liquor$pm1 <- liquor$pm2 <- liquor$sales_adj
w <- which(is.na(liquor$sales_adj))
for(i in 1:24){
liquor$pm1[w[i]] <- liquor$pm1[w[i] - 1]*(1 + sinh(pm1$pred[i]))
liquor$pm2[w[i]] <- liquor$pm2[w[i] - 1]*(1 + sinh(pm2$pred[i]))
# liquor$pm2_u[w[i]] <- liquor$pm2[w[i] - 1]*(1 + sinh(pm2$pred[i]) + 1.645*sinh(pm2$se[i]))
# liquor$pm2_l[w[i]] <- liquor$pm2[w[i] - 1]*(1 + sinh(pm2$pred[i]) - 1.645*sinh(pm2$se[i]))
}
lim <- liquor$date >= ymd("2015-01-01")
plot(liquor$date[lim], liquor$pm1[lim], type = "l", col = scales::alpha("tomato", .6), lwd = 3)
lines(liquor$date[lim], liquor$pm2[lim], col = scales::alpha("dodgerblue", .6), lwd = 3)
lines(liquor$date[lim], liquor$sales_adj[lim], type = "l", lwd = 3)ARMA(p,q) - model combining AR(p) and MA(q) models.
ARIMA(p, d, q) - ARMA models that incorporate differencing.
\[\begin{align} Y_{t} = \ &\alpha + \beta_1 Y_{t-1} + \ ...\ + \beta_p Y_{t-p} + e_{t} \\ & + \theta_1 e_{t-1} + \ ... \ \theta_{q}e_{t-q} \end{align}\]
The forecast package has some nice functions for estimating ARIMA models.
Arima() works very similar to arima() but is slightly more flexible. We will move towards this function.
auto.arima uses statistical criteria to select the “best” (most likely) model. Check out this resource if you are interested.
Let’s use auto.arima() to help us estimate liquor sales.
Series: asinh(liquor$sales_adj[!is.na(liquor$sales_adj)])
ARIMA(1,1,1) with drift
Coefficients:
ar1 ma1 drift
-0.1448 -0.2917 0.0031
s.e. 0.1220 0.1163 0.0004
sigma^2 = 0.0001347: log likelihood = 1019.16
AIC=-2030.31 AICc=-2030.19 BIC=-2015.06
f <- forecast(reg, h=24)
plot(liquor$date, liquor$sales_adj, type = "l", lwd = 2,
ylim = range(sinh(f$upper[,2]), sinh(f$lower[,2]), liquor$sales_adj[year(liquor$date) >= 2015], na.rm = TRUE),
xlim = ymd(c("2015-01-01", "2022-12-01")))
lines(liquor$date[is.na(liquor$sales_adj)], sinh(f$mean),
col = "tomato", lwd = 2)
lines(liquor$date[is.na(liquor$sales_adj)], sinh(f$upper[,2]),
col = "tomato", lwd = 2, lty = 2)
lines(liquor$date[is.na(liquor$sales_adj)], sinh(f$lower[,2]),
col = "tomato", lwd = 2, lty = 2)(Generalized) Autoregressive Conditional Heteroskedasticity
Take stocks for example:
It is extremely difficult to model/forecast returns (EMH vs BF)
Stock Returns usually follow a random walk with a drift/trend
Modeling a stocks’s volatility has some merit.
Daily Google Stock Returns
Here are some resources for further reading:
ECON 707/807: Econometrics II